Skip to content

fix(geometry): retry point-in-polygon with finer batching on GPU OOM#62

Open
EliHei2 wants to merge 1 commit into
mainfrom
bugfix/spatial-join-oom
Open

fix(geometry): retry point-in-polygon with finer batching on GPU OOM#62
EliHei2 wants to merge 1 commit into
mainfrom
bugfix/spatial-join-oom

Conversation

@EliHei2

@EliHei2 EliHei2 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

The batched quadtree point-in-polygon join could OOM on very large inputs (notably MERSCOPE, with millions of transcripts) and crash the run. Wrap the batch loop so a CUDA out-of-memory error retries with progressively finer batching (doubling the batch count up to 256) before giving up, and return an empty match frame when there are no results instead of failing in cudf.concat.

What to review: the OOM-detection (errors only; non-OOM re-raised immediately) and the bounded retry in _points_in_polygons_contains. Keeps main's quadtree API; no change to the happy path.

The batched quadtree point-in-polygon join could OOM on very large inputs
(notably MERSCOPE, with millions of transcripts) and crash the run. Wrap the
batch loop so a CUDA out-of-memory error retries with progressively finer
batching (doubling the batch count up to 256) before giving up, and return an
empty match frame when there are no results instead of failing in cudf.concat.

What to review: the OOM-detection (errors only; non-OOM re-raised immediately)
and the bounded retry in _points_in_polygons_contains. Keeps main's quadtree
API; no change to the happy path.
@Tobiaspk

Tobiaspk commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Good catch, down to merge if it makes things more robust. Two things first:

  1. Do you have a log that shows that points-in-polygon is really the failure. The OOMs I've seen on our SLURM cluster otfen show up as oom_kill events or C++ std::bad_alloc (which is a RuntimeError) and those happen outside points-in-polygon for the 1B transcript Atera dataset. As side-node, I think we couldn't even catch an oom_kill here, which is a slurm SIGKILL. Our out of memory logs typically look like:
/var/spool/slurmd/job11412228/slurm_script: line 21: 4167901 Killed                  segger segment --input-directory /data1/collab002/sail/projects/ongoing/segger_dev/data/inputs/WTA_Preview_FFPE_Breast_Cancer --output-directory /data1/collab002/sail/projects/ongoing/segger_dev/data/outputs/WTA_Preview_FFPE_Breast_Cancer_v3 --tiling-margin-training 5.0 --tiling-margin-prediction 5.0 --debug
[2026-05-27T20:15:16.212] error: Detected 1 oom_kill event in StepId=11412228.batch. Some of the step tasks have been OOM Killed.

Could you share a log showing it crash in points-in-polygon on your end?

  1. Going forward, similar to fix(tiling): fall back to a smaller margin instead of dropping tiles #61 let's try to estimate max batch size up front, rather than rely on error catching (which admittedly will be hard for GPU). Alternatively, we could try configure the RMM and CuPy allocators to avoid this out of the box.

Happy to merge meanwhile, but would like to confirm it's really points-in-polygon before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants